Search CORE

8 research outputs found

Seasonal Decomposition for Geographical Time Series using Nonparametric Regression

Author: Gweon Hyukjun
Publication venue: Scholarship@Western
Publication date: 23/04/2013
Field of study

A time series often contains various systematic effects such as trends and seasonality. These different components can be determined and separated by decomposition methods. In this thesis, we discuss time series decomposition process using nonparametric regression. A method based on both loess and harmonic regression is suggested and an optimal model selection method is discussed. We then compare the process with seasonal-trend decomposition by loess STL (Cleveland, 1979). While STL works well when that proper parameters are used, the method we introduce is also competitive: it makes parameter choice more automatic and less complex. The decomposition process often requires that time series be evenly spaced; any missing value is therefore necessarily estimated. For time series with seasonality, it is preferable to use the seasonal information to estimate missing observations. The seasonal adjustment algorithm (McLeod et al., 1983) can be used for monthly time series. In this thesis, we examine the algorithm and revise it to cover daily data cases

Scholarship@Western

Nearest Labelset Using Double Distances for Multi-label Classification

Author: Gweon Hyukjun
Schonlau Matthias
Steiner Stefan
Publication venue
Publication date: 15/02/2017
Field of study

Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this paper we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of Hamming loss, 0/1 loss, and multi-label accuracy and ranks second after ECC on the F-measure

arXiv.org e-Print Archive

Automated classification for open-ended questions with BERT

Author: Gweon Hyukjun
Schonlau Matthias
Publication venue
Publication date: 25/04/2023
Field of study

Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. Recently, pre-training a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than other non-pre-trained statistical learning approaches. We found fine-tuning the pre-trained BERT parameters is essential as otherwise BERT's is not competitive. Second, we found fine-tuned BERT barely beats the non-pre-trained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT's relative advantage increases rapidly when more manually coded observations (e.g. 200-400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting

arXiv.org e-Print Archive

Statistical Learning Approaches to Some Classification Problems

Author: Gweon Hyukjun
Publication venue: 'University of Waterloo'
Publication date: 24/07/2017
Field of study

Classification is essential in statistical learning. This thesis deals with three topics in classification: multi-label classification, nonparametric multi-class classification and a special type of text categorization called occupation coding. For each topic, novel approaches are proposed with the goal of high predictive performance. This is empirically demonstrated for each method. In multi-label classification, observations may be associated with multiple classes or labels simultaneously. Generally, correlations exist between labels and taking into account the label correlations is important during the classification process. This thesis proposes an approach based on the nearest neighbor principle that considers neighbors both in the feature (x) and the label (y) space. The proposed method chooses the labelset of a training observation that minimizes a weighted function of the distances in feature and label space. By selecting an entire labelset as the prediction, the method implicitly considers label correlations. In multi-class classification, the well-known k-nearest neighbors method is especially desirable when the response surface exhibits highly local behavior. A novel approach is presented that makes a prediction based on the k-th nearest neighbor from each class. The method not only provides estimates for class posterior probabilities but also converges to the Bayes classifier as the size of the training data increases. Further, the method is extended using the idea of an ensemble. Occupation coding is an important multi-class text categorization problem. Since fully automated classification is challenging, researchers focus more on partially automated coding. Three approaches based on underlying statistical learning methods are proposed to improve the classification accuracy of the underlying statistical learning methods

University of Waterloo's Institutional Repository

The k conditional nearest neighbor algorithm for classification and class probability estimation

Author: Hyukjun Gweon
Matthias Schonlau
Stefan H. Steiner
Publication venue: 'PeerJ'
Publication date: 01/05/2019
Field of study

The k nearest neighbor (kNN) approach is a simple and effective nonparametric algorithm for classification. One of the drawbacks of kNN is that the method can only give coarse estimates of class probabilities, particularly for low values of k. To avoid this drawback, we propose a new nonparametric classification method based on nearest neighbors conditional on each class: the proposed approach calculates the distance between a new instance and the kth nearest neighbor from each class, estimates posterior probabilities of class memberships using the distances, and assigns the instance to the class with the largest posterior. We prove that the proposed approach converges to the Bayes classifier as the size of the training data increases. Further, we extend the proposed approach to an ensemble method. Experiments on benchmark data sets show that both the proposed approach and the ensemble version of the proposed approach on average outperform kNN, weighted kNN, probabilistic kNN and two similar algorithms (LMkNN and MLM-kHNN) in terms of the error rate. A simulation shows that kCNN may be useful for estimating posterior probabilities when the class distributions overlap

Directory of Open Access Journals

Three Methods for Occupation Coding Based on Statistical Learning

Author: Blohm Michael
Gweon Hyukjun
Kaczmirek Lars
Schonlau Matthias
Steiner Stefan
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2017
Field of study

Occupation coding, an important task in ofﬁcial statistics, refers to coding a respondent's text answer into one of many hundreds of occupation codes. To date, occupation coding is still at least partially conducted manually, at great expense. We propose three methods for automatic coding: combining separate models for the detailed occupation codes and for aggregate occupation codes, a hybrid method that combines a duplicate-based approach with a statistical learning algorithm, and a modiﬁed nearest neighbor approach. Using data from the German General Social Survey (ALLBUS), we show that the proposed methods improve on both the coding accuracy of the underlying statistical learning algorithm and the coding accuracy of duplicates where duplicates exist. Further, we ﬁnd deﬁning duplicates based on ngram variables (a concept from text mining) is preferable to one based on exact string matches

SSOAR - Social Science Open Access Repository